Down syndrome (DS) is the most common genetic cause of intellectual disability (ID), however there are no treatment. In this study we used the Ts65Dn mouse model of DS which is trisomic for orthologs of 55% of human chromosome 21 classical protein coding genes. To investigate potentially differentially expressed genes, we compared 8 classes of mice in accordance with their genotypes, behaviour and treatment.
Down’s syndrome, also known as trisomy 21, is a genetic disorder caused by the presence of all or part of a third copy of chromosome 21. It is usually associated with physical growth delays, mild to moderate intellectual disability, and characteristic facial features.The average IQ of a young adult with Down syndrome is 50, equivalent to the mental ability of an eight- or nine-year-old child. The parents of the affected individual are usually genetically normal. The probability increases from less than 0.1% in 20-year-old mothers to 3% in those of age 45. The extra chromosome is believed to occur by chance, with no known behavioral activity or environmental factor that changes the probability. Revealing particular diferentially expressed genes between classes is a useful tool in a context of development new treatment strategy.
For our analysis it is necessary to use these libraries. Please check that these libraries have installed.
require(readxl)
require(ggplot2)
require(tidyr)
require(dplyr)
require(cowplot)
require(RColorBrewer)
require(car)
require(multcomp)
require(vegan)
require(ggfortify)
require(factoextra)
require(pca3d)
## Warning in rgl.init(initValue, onlyNULL): RGL: unable to open X11 display
## Warning: 'rgl.init' failed, running with 'rgl.useNULL = TRUE'.
require(rgl)
require(plotly)
We collect the data named “Mice Protein Expression Data Set” from UCI Machine Learning Repository in .xls format.
setwd('~/R_mouse')
Mice <- read_xls('Data_Cortex_Nuclear.xls')
Mice$class_factor <- as.factor(Mice$class)
Our data represent the level of gene expression of 77 proteins among 72 mouse. For each mouse we have many repeating measurements, so it is useful to add new column represented particular mice.
Mice$ID <- lapply(strsplit(Mice$MouseID, '_'), function(x)x[1])
We distinguished 8 classes of mice: c-CS-m, c-SC-m, c-CS-s, c-SC-s, t-CS-m, t-SC-m, t-CS-s, t-SC-s. From data description we understood that meaning:
c-CS-s: control mice, stimulated to learn, injected with saline;
c-CS-m: control mice, stimulated to learn, injected with memantine;
c-SC-s: control mice, not stimulated to learn, injected with saline;
c-SC-m: control mice, not stimulated to learn, injected with memantine;
t-CS-s: trisomy mice, stimulated to learn, injected with saline;
t-CS-m: trisomy mice, stimulated to learn, injected with memantine;
t-SC-s: trisomy mice, not stimulated to learn, injected with saline;
t-SC-m: trisomy mice, not stimulated to learn, injected with memantine
unique(Mice[,c('ID','class_factor')]) %>% group_by(class_factor) %>% count()
## # A tibble: 8 x 2
## # Groups: class_factor [8]
## class_factor n
## <fct> <int>
## 1 c-CS-m 10
## 2 c-CS-s 9
## 3 c-SC-m 10
## 4 c-SC-s 9
## 5 t-CS-m 9
## 6 t-CS-s 7
## 7 t-SC-m 9
## 8 t-SC-s 9
We will take a look at the amount of measurements in different groups. It’s clear that we have 15 repeated measurements for each mice, so it this step it’s sufficient to estimate values only among unique mice (not considering repeated measurements). We represented data in three different ways:
ggplot(data = unique(Mice[,c('ID','class_factor', 'Treatment', 'Genotype', 'Behavior')]), aes(x = class_factor,fill = Genotype)) + geom_bar() + theme_bw() +xlab('Class') + ylab('Amount of mice') + ggtitle('Amount of mice in different classes with respect of genotype') + scale_fill_brewer(palette = "Set2")
It’s clear from barcharts that our groups are not balanced, the amount of mice with control genotype is bigger.
Tr <- ggplot(data = unique(Mice[,c('ID', 'Treatment', 'Genotype', 'Behavior')]), aes(x = Genotype,fill = Treatment)) + geom_bar() + theme_bw() +xlab('Genotype') + ylab('Amount of mice') + ggtitle('Amount of mice with different\ngenotypes with respect of\ntreatment') + scale_fill_brewer(palette = "Set2")
Beh <- ggplot(data = unique(Mice[,c('ID','Treatment', 'Genotype', 'Behavior')]), aes(x = Genotype, fill = Behavior)) + geom_bar() + theme_bw() +xlab('Genotype') + ylab('Amount of mice') + ggtitle('Amount of mice with different\ngenotypes with respect of\nbehavior') + scale_fill_brewer(palette = "Set2")
plot_grid(Tr, Beh, nrow = 1)
Then it’s important to calcutate the amount of NA-values in our data. The percentage of rows without NA may be estimated as 51.1111111 %. It’s too many to delete it all, so we will decide what to do in each particular type of further analysis.
Firstly, we compared the level of BDNF_N gene expression between classes. To exclude NA only in BDNF_N column (and also ID, treatment, class of course), we created a smaller dataframe:
BDNF_N_analysis <- na.omit(Mice[,c('ID', 'BDNF_N', 'Treatment', 'class_factor')])
In such way we saved rows with NA in different columns and as the result we got 1077 rows out of 1080. At the first step, we graphically compared the level of BDNF_N expression among classes:
ggplot(data = BDNF_N_analysis, aes(x = class_factor, y = BDNF_N, fill = Treatment )) + geom_boxplot() + scale_fill_brewer(palette = "Set2") + theme_bw() +xlab('Classes') + ylab('BDNF_N expression level') + ggtitle('BDNF_N gene expression level')
Firstly, we tried Anova to find signigicant diferences between classes.
model <- lm(BDNF_N ~ class_factor, BDNF_N_analysis)
model_anova <- Anova(model)
model_anova
## Anova Table (Type II tests)
##
## Response: BDNF_N
## Sum Sq Df F value Pr(>F)
## class_factor 0.28784 7 18.816 < 2.2e-16 ***
## Residuals 2.33619 1069
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Significant differences in BDNF_N expression level between groups were detected (F = 18.816, p_value < 2.2e-16, df_1 = 7, df_2 = 1069) Of note, ANOVA has applicability conditions: normal distribution of residues, homogeneity of dispersion residues, no collinearity (group independence), randomness and independence of observations in groups.
mod_diag <- fortify(model)
ggplot(mod_diag, aes(x = 1:nrow(mod_diag), y = .cooksd)) + geom_bar(stat = "identity", fill = 'chartreuse3') + ggtitle("Cook's distance graph ") + theme_bw() + xlab('Row number') + ylab('Cook distance')
Cook distances are small that’s why model fits. Then we checked the distribution of residues
ggplot(mod_diag, aes(x = class_factor, y = .stdresid, fill = class_factor)) + geom_boxplot() + ggtitle("The distribution of residues") + theme_bw() + scale_fill_brewer(palette = "Set2") + xlab("Class") + ylab("Resudues distribution")
In our case there are no significant differences amoung the amount of groups, that’s why Anova is a good tool for our results even with some outliers.
To identify pairwise differences, the Tukey post-hock test was performed.
model <- lm(BDNF_N ~ class_factor, BDNF_N_analysis)
post_hoch <- multcomp::glht(model, mcp(class_factor = "Tukey"))
result<-summary(post_hoch)
result
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: lm(formula = BDNF_N ~ class_factor, data = BDNF_N_analysis)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## c-CS-s - c-CS-m == 0 0.0030979 0.0055459 0.559 0.9993
## c-SC-m - c-CS-m == 0 -0.0482717 0.0053980 -8.942 <0.01 ***
## c-SC-s - c-CS-m == 0 -0.0258249 0.0055459 -4.657 <0.01 ***
## t-CS-m - c-CS-m == 0 -0.0264852 0.0055459 -4.776 <0.01 ***
## t-CS-s - c-CS-m == 0 -0.0337570 0.0059483 -5.675 <0.01 ***
## t-SC-m - c-CS-m == 0 -0.0181541 0.0055459 -3.273 0.0240 *
## t-SC-s - c-CS-m == 0 -0.0136310 0.0055790 -2.443 0.2208
## c-SC-m - c-CS-s == 0 -0.0513696 0.0055459 -9.263 <0.01 ***
## c-SC-s - c-CS-s == 0 -0.0289228 0.0056900 -5.083 <0.01 ***
## t-CS-m - c-CS-s == 0 -0.0295831 0.0056900 -5.199 <0.01 ***
## t-CS-s - c-CS-s == 0 -0.0368549 0.0060829 -6.059 <0.01 ***
## t-SC-m - c-CS-s == 0 -0.0212520 0.0056900 -3.735 <0.01 **
## t-SC-s - c-CS-s == 0 -0.0167289 0.0057223 -2.923 0.0686 .
## c-SC-s - c-SC-m == 0 0.0224468 0.0055459 4.047 <0.01 **
## t-CS-m - c-SC-m == 0 0.0217865 0.0055459 3.928 <0.01 **
## t-CS-s - c-SC-m == 0 0.0145147 0.0059483 2.440 0.2228
## t-SC-m - c-SC-m == 0 0.0301176 0.0055459 5.431 <0.01 ***
## t-SC-s - c-SC-m == 0 0.0346406 0.0055790 6.209 <0.01 ***
## t-CS-m - c-SC-s == 0 -0.0006603 0.0056900 -0.116 1.0000
## t-CS-s - c-SC-s == 0 -0.0079321 0.0060829 -1.304 0.8972
## t-SC-m - c-SC-s == 0 0.0076708 0.0056900 1.348 0.8798
## t-SC-s - c-SC-s == 0 0.0121939 0.0057223 2.131 0.3953
## t-CS-s - t-CS-m == 0 -0.0072718 0.0060829 -1.195 0.9331
## t-SC-m - t-CS-m == 0 0.0083311 0.0056900 1.464 0.8260
## t-SC-s - t-CS-m == 0 0.0128542 0.0057223 2.246 0.3246
## t-SC-m - t-CS-s == 0 0.0156029 0.0060829 2.565 0.1697
## t-SC-s - t-CS-s == 0 0.0201260 0.0061130 3.292 0.0226 *
## t-SC-s - t-SC-m == 0 0.0045231 0.0057223 0.790 0.9936
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
The most significant differences were revealed in pairs:
c-SC-m & c-CS-m (F = -0.0482717, p_value < 0.001)
c-SC-s & c-CS-m (F = -0.0258249, p_value < 0.001)
t-CS-m & c-CS-m (F = -0.0264852, p_value < 0.001)
t-CS-s & c-CS-m (F = -0.0337570, p_value < 0.001)
c-SC-m & c-CS-s (F = -0.0513696, p_value < 0.001)
c-SC-s & c-CS-s (F = -0.0289228, p_value < 0.001)
t-CS-m & c-CS-s (F = -0.0295831, p_value < 0.001)
t-CS-s & c-CS-s (F = -0.0368549, p_value < 0.001)
t-SC-m & c-SC-m (F = 0.0301176, p_value < 0.001)
t-SC-s & c-SC-m (F = 0.0346406, p_value < 0.001)
Interestingly, no significant difference revealed between t-SC-m & c-SC-s, suggesting that memantine can somewhat normalize the expression of BDNF_N gene.
Then we tried to build a linear model that can predict the level of production of the ERBB4_N protein based on data on other 76 proteins in the experiment. Let’s firstly calculate the amount of NA’s among different proteins. It’s too much to exclude all NA’s and we had to be tricky.
FIND_NA <- as.data.frame(is.na(Mice))
Result <- as.data.frame(apply(FIND_NA, 2, sum))
Result
## apply(FIND_NA, 2, sum)
## MouseID 0
## DYRK1A_N 3
## ITSN1_N 3
## BDNF_N 3
## NR1_N 3
## NR2A_N 3
## pAKT_N 3
## pBRAF_N 3
## pCAMKII_N 3
## pCREB_N 3
## pELK_N 3
## pERK_N 3
## pJNK_N 3
## PKCA_N 3
## pMEK_N 3
## pNR1_N 3
## pNR2A_N 3
## pNR2B_N 3
## pPKCAB_N 3
## pRSK_N 3
## AKT_N 3
## BRAF_N 3
## CAMKII_N 3
## CREB_N 3
## ELK_N 18
## ERK_N 3
## GSK3B_N 3
## JNK_N 3
## MEK_N 7
## TRKA_N 3
## RSK_N 3
## APP_N 3
## Bcatenin_N 18
## SOD1_N 3
## MTOR_N 3
## P38_N 3
## pMTOR_N 3
## DSCR1_N 3
## AMPKA_N 3
## NR2B_N 3
## pNUMB_N 3
## RAPTOR_N 3
## TIAM1_N 3
## pP70S6_N 3
## NUMB_N 0
## P70S6_N 0
## pGSK3B_N 0
## pPKCG_N 0
## CDK5_N 0
## S6_N 0
## ADARB1_N 0
## AcetylH3K9_N 0
## RRP1_N 0
## BAX_N 0
## ARC_N 0
## ERBB4_N 0
## nNOS_N 0
## Tau_N 0
## GFAP_N 0
## GluR3_N 0
## GluR4_N 0
## IL1B_N 0
## P3525_N 0
## pCASP9_N 0
## PSD95_N 0
## SNCA_N 0
## Ubiquitin_N 0
## pGSK3B_Tyr216_N 0
## SHH_N 0
## BAD_N 213
## BCL2_N 285
## pS6_N 0
## pCFOS_N 75
## SYP_N 0
## H3AcK18_N 180
## EGR1_N 210
## H3MeK4_N 270
## CaNA_N 0
## Genotype 0
## Treatment 0
## Behavior 0
## class 0
## class_factor 0
## ID 0
Now it became clear that for better prediction it’s necessary to exclude proteins: BAD_N, BCL2_N, pCFOS_N, H3AcK18_N, EGR1_N, H3MeK4_N.
exclude_names <- c("BAD_N", "BCL2_N", "pCFOS_N", "H3AcK18_N", "EGR1_N", "H3MeK4_N")
Mice_good <- Mice[,!colnames(Mice) %in% exclude_names]
Mice_good <- na.omit(Mice_good)
After such step we have 1047 rows, that’s pretty good because we saved more data then potentially lost. To create good linear model let’s firstly caclulate a correlation matrix.
Mice_good_numeric <- Mice_good[,2:72]
Cor_matrix <- cor(Mice_good_numeric)
We tried to find proteins which is strongly correlate among each other (with coefficient 0.7 or more) and exclude it.
cor_vector <- vector()
Cor_matrix_without_ERBB4_N <- Cor_matrix[!row.names(Cor_matrix) %in% 'ERBB4_N', !colnames(Cor_matrix) %in% 'ERBB4_N']
for (i in 1:ncol(Cor_matrix_without_ERBB4_N)){
for (j in 1:nrow(Cor_matrix_without_ERBB4_N)){
if (abs(Cor_matrix_without_ERBB4_N[i,j] > 0.7) & i!=j){
cor_vector <- c(cor_vector, i, j)
}
}
}
#proteins with these index should be excluded
cor_vector <- unique(cor_vector)
proteins_exclude <- row.names(Cor_matrix_without_ERBB4_N[cor_vector,])
proteis_predictors <- colnames(Cor_matrix_without_ERBB4_N)[!colnames(Cor_matrix_without_ERBB4_N) %in% proteins_exclude]
proteis_predictors
## [1] "pCAMKII_N" "pNR2A_N" "pRSK_N" "APP_N"
## [5] "SOD1_N" "pP70S6_N" "P70S6_N" "pGSK3B_N"
## [9] "pPKCG_N" "CDK5_N" "S6_N" "ADARB1_N"
## [13] "RRP1_N" "BAX_N" "nNOS_N" "GFAP_N"
## [17] "GluR3_N" "GluR4_N" "IL1B_N" "P3525_N"
## [21] "pCASP9_N" "PSD95_N" "SNCA_N" "Ubiquitin_N"
## [25] "pGSK3B_Tyr216_N" "SHH_N" "SYP_N" "CaNA_N"
Using these 28 predictors, we made a linear model:
m1 <- lm(ERBB4_N ~ pCAMKII_N + pNR2A_N + pRSK_N + APP_N + SOD1_N + pP70S6_N + P70S6_N + pGSK3B_N + pPKCG_N + CDK5_N + S6_N + ADARB1_N + RRP1_N + BAX_N + nNOS_N + GFAP_N + GluR3_N + IL1B_N + P3525_N + pCASP9_N + PSD95_N +SNCA_N + Ubiquitin_N +pGSK3B_Tyr216_N + SHH_N + SYP_N + CaNA_N, data = Mice_good)
summary(m1)
##
## Call:
## lm(formula = ERBB4_N ~ pCAMKII_N + pNR2A_N + pRSK_N + APP_N +
## SOD1_N + pP70S6_N + P70S6_N + pGSK3B_N + pPKCG_N + CDK5_N +
## S6_N + ADARB1_N + RRP1_N + BAX_N + nNOS_N + GFAP_N + GluR3_N +
## IL1B_N + P3525_N + pCASP9_N + PSD95_N + SNCA_N + Ubiquitin_N +
## pGSK3B_Tyr216_N + SHH_N + SYP_N + CaNA_N, data = Mice_good)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.033468 -0.004725 0.000252 0.004740 0.036943
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.093e-02 4.989e-03 4.196 2.95e-05 ***
## pCAMKII_N -8.197e-05 4.264e-04 -0.192 0.847592
## pNR2A_N -8.699e-03 3.019e-03 -2.882 0.004035 **
## pRSK_N 3.135e-02 7.065e-03 4.438 1.01e-05 ***
## APP_N -6.121e-03 5.243e-03 -1.168 0.243263
## SOD1_N 9.596e-04 1.458e-03 0.658 0.510641
## pP70S6_N 5.460e-03 3.517e-03 1.553 0.120843
## P70S6_N -5.723e-03 2.952e-03 -1.939 0.052806 .
## pGSK3B_N 6.790e-02 2.550e-02 2.663 0.007869 **
## pPKCG_N -6.626e-03 9.320e-04 -7.110 2.19e-12 ***
## CDK5_N 2.381e-02 9.709e-03 2.452 0.014374 *
## S6_N 1.477e-02 3.230e-03 4.574 5.38e-06 ***
## ADARB1_N 8.543e-03 1.093e-03 7.817 1.34e-14 ***
## RRP1_N -2.348e-02 9.940e-03 -2.362 0.018375 *
## BAX_N 7.933e-02 2.131e-02 3.722 0.000208 ***
## nNOS_N 5.520e-02 1.674e-02 3.296 0.001013 **
## GFAP_N -4.910e-02 3.199e-02 -1.535 0.125105
## GluR3_N -6.620e-02 1.019e-02 -6.496 1.28e-10 ***
## IL1B_N 6.890e-02 6.462e-03 10.663 < 2e-16 ***
## P3525_N 1.003e-01 1.560e-02 6.433 1.92e-10 ***
## pCASP9_N 8.406e-03 1.993e-03 4.217 2.69e-05 ***
## PSD95_N 1.526e-02 1.760e-03 8.674 < 2e-16 ***
## SNCA_N -5.001e-03 2.178e-02 -0.230 0.818478
## Ubiquitin_N -6.014e-04 3.228e-03 -0.186 0.852241
## pGSK3B_Tyr216_N 1.496e-03 4.631e-03 0.323 0.746704
## SHH_N -7.175e-03 1.212e-02 -0.592 0.554033
## SYP_N 2.673e-02 6.223e-03 4.295 1.91e-05 ***
## CaNA_N -9.209e-03 2.138e-03 -4.307 1.82e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.007815 on 1019 degrees of freedom
## Multiple R-squared: 0.7422, Adjusted R-squared: 0.7354
## F-statistic: 108.7 on 27 and 1019 DF, p-value: < 2.2e-16
The model is quite good (Adjusted R-squared: 0.7354). We tried to correct these model to make it even more appropriate, but the results was wrong (available to take a look at .Rmd file if change include value in a chank)
Then it’s important to diagnose the resulting linear model. Let’s plot Cook distance as previously used:
mod_diag_2 <- fortify(m1)
ggplot(mod_diag_2, aes(x = 1:nrow(mod_diag_2), y = .cooksd)) + geom_bar(stat = "identity", fill = 'chocolate1') + ggtitle("Cook's distance graph ") + theme_bw() + xlab('Row number') + ylab('Cook distance')
And also we plotted the distribution of residuals from the model:
predict_value <- m1$fitted.values
Mice_good$DIF <- predict_value - Mice_good$ERBB4_N
ggplot(data = Mice_good, aes (x = 1:nrow(Mice_good), y = DIF)) + geom_point() + theme_bw() + geom_hline(yintercept = 0, col = 'red') + xlab('Row number') + ylab('Difference between predicted and real value')
To conclude, i may be proud of these model: we can’t see any patterns among residuals, Adjusted R-squared: 0.7354, F-statistic: 108.7 on 27 and 1019 DF, p-value: < 2.2e-16.
We performed PCA to reduce dimensions:
proteins.pca <- prcomp(Mice_good_numeric, center = TRUE,scale. = TRUE)
summary(proteins.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 4.4503 3.4072 2.8105 2.33899 1.86301 1.82428 1.55734
## Proportion of Variance 0.2789 0.1635 0.1113 0.07705 0.04888 0.04687 0.03416
## Cumulative Proportion 0.2789 0.4425 0.5537 0.63077 0.67965 0.72652 0.76068
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.45602 1.28359 1.11052 1.01509 0.94326 0.85416 0.84546
## Proportion of Variance 0.02986 0.02321 0.01737 0.01451 0.01253 0.01028 0.01007
## Cumulative Proportion 0.79054 0.81375 0.83112 0.84563 0.85816 0.86844 0.87851
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.79527 0.75263 0.70358 0.68280 0.64661 0.63958 0.61419
## Proportion of Variance 0.00891 0.00798 0.00697 0.00657 0.00589 0.00576 0.00531
## Cumulative Proportion 0.88741 0.89539 0.90236 0.90893 0.91482 0.92058 0.92589
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.60263 0.56478 0.5461 0.53550 0.51218 0.48899 0.47211
## Proportion of Variance 0.00511 0.00449 0.0042 0.00404 0.00369 0.00337 0.00314
## Cumulative Proportion 0.93101 0.93550 0.9397 0.94374 0.94744 0.95080 0.95394
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 0.44843 0.43174 0.42033 0.4128 0.40859 0.38049 0.37469
## Proportion of Variance 0.00283 0.00263 0.00249 0.0024 0.00235 0.00204 0.00198
## Cumulative Proportion 0.95677 0.95940 0.96189 0.9643 0.96664 0.96868 0.97066
## PC36 PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 0.35654 0.34868 0.34328 0.33609 0.33136 0.32733 0.31632
## Proportion of Variance 0.00179 0.00171 0.00166 0.00159 0.00155 0.00151 0.00141
## Cumulative Proportion 0.97245 0.97416 0.97582 0.97741 0.97896 0.98047 0.98188
## PC43 PC44 PC45 PC46 PC47 PC48 PC49
## Standard deviation 0.29736 0.2917 0.28820 0.28157 0.27362 0.2671 0.25851
## Proportion of Variance 0.00125 0.0012 0.00117 0.00112 0.00105 0.0010 0.00094
## Cumulative Proportion 0.98312 0.9843 0.98549 0.98661 0.98766 0.9887 0.98961
## PC50 PC51 PC52 PC53 PC54 PC55 PC56
## Standard deviation 0.24793 0.24498 0.23645 0.22764 0.22103 0.21583 0.21045
## Proportion of Variance 0.00087 0.00085 0.00079 0.00073 0.00069 0.00066 0.00062
## Cumulative Proportion 0.99047 0.99132 0.99210 0.99283 0.99352 0.99418 0.99480
## PC57 PC58 PC59 PC60 PC61 PC62 PC63
## Standard deviation 0.2063 0.19942 0.19206 0.1891 0.18430 0.17125 0.16071
## Proportion of Variance 0.0006 0.00056 0.00052 0.0005 0.00048 0.00041 0.00036
## Cumulative Proportion 0.9954 0.99596 0.99648 0.9970 0.99746 0.99788 0.99824
## PC64 PC65 PC66 PC67 PC68 PC69 PC70
## Standard deviation 0.15930 0.15640 0.13731 0.12629 0.12347 0.1198 0.10325
## Proportion of Variance 0.00036 0.00034 0.00027 0.00022 0.00021 0.0002 0.00015
## Cumulative Proportion 0.99860 0.99894 0.99921 0.99943 0.99965 0.9999 1.00000
## PC71
## Standard deviation 1.976e-15
## Proportion of Variance 0.000e+00
## Cumulative Proportion 1.000e+00
To visualize a percentage of the total variation for the first 9 components we created a plot:
Importance <- summary(proteins.pca)
DF_importance <- as.data.frame(t(as.matrix(Importance$importance)))
DF_importance$PC <- rownames(DF_importance)
DF_importance$Variance <- DF_importance$`Proportion of Variance`
ggplot(data=DF_importance[1:9,], aes(x = PC, y = Variance)) + geom_bar(stat = "identity") + theme_bw() + xlab("Principal component") + ylab("Percentage of the total variation")+ ggtitle("A percentage of the total variation\nfor the first 9 PC")
In order to precisely recognize the difference between groups, we also plotted 2D plot (PC1 & PC2):
autoplot(proteins.pca, data = Mice_good, colour = 'class') + theme_bw()
## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
And interactive 3D plot (PC1 & PC2 & PC3)
plot_ly(x=proteins.pca$x[,1], y=proteins.pca$x[,2], z=proteins.pca$x[,3], type="scatter3d", mode="markers", color = Mice_good$class_factor)
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
We have too much proteins, that’s why factor load graph is a bit meaningless. The angles between the vectors reflect the correlations of features with each other and with the axes of the principal components.
fviz_pca_var(proteins.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE
)